Capstone Project

Group - Capstone NLP - 2 Group - 2

Mentor - Mr. Rohit Raj

Members

  1. Ajay Kumar
  2. John Cherian
  3. Kiran Bobba
  4. Mario Mathew
  5. Shankhadeep

Part 1

Summary of problem statement, data and findings

1.1 Problem Statement

For industries around the world, accidents in the work place are of a major concern, since it affects the lives and well being of their employees, contractors and their families and the industry faces loses in terms of hospital charges, litigation fees, reputation and lost employee morale. Based on these facts it is intented to build a chatbot that can highlight the safety risk as per the incident description to the professionals including:

1.Personnel from the safety and complaince team

2.Senior management from the plant

3.Personnel from other plants across the globe

4.Government and industrial safety groups 5.Anyone interested or doing research in industrial safety

6.Emergency health and safety teams

7.Fire safety and industrial hazard teams

8.General management

9.Other personnel requiring safety risk information

so that these professionals can:

Take preventive and proactive measures based on past history React faster to employee satisfaction realated to safety Help postion the equipment and machinery in a safe place where risk of potential acceidents can be minimised Gain insights about safety in industries safety is paramound Reduce insurance costs by better handling of personnel, equipment and other resources Take other safety related decisions and actions

1.2 Outcome

The user should be able to input an incident description and the chatbot should be able to predict the potential accident or vulnerability levels which can be extended or configured to different scenarios

1.3 The Data

The dataset basically describes the accident incidents from twelve different plants across three different countries and consists of four hundred and twenty five records It has the following columns:

Date: timestamp or time/date information

Countries: Which country the accident occurred (anonymised)

Local: The city where the manufacturing plant is located (anonymised)

Industry sector: Which sector the plant belongs to

Accident level: From I to VI, it registers how severe was the accident (I means not severe but VI means very severe)

Potential Accident Level: From I to VI, depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)

Gender: If the person involved is male of female

Employee or Third Party: If the injured person is an employee or a third party / contractor

Critical Risk: Description of the risk involved in the accident

Description: Detailed description of how the accident happened

1.4 Summary of findings and implications

On inspection of the dataset it appears that:

1.The dataset is limited and consists of four hundred and twenty five records only so training the models with high accuracy could be a challange

2.The dataset is imbalanced on certain variables like potential accident level and accident level, this means that we may not get consistent results unless the dataset is treated to reduce imbalance

3.Minor accidents are more common than major accidents, this looks similar to real world situations

4.There is data from three countries

5.There are twelve locals or cities from which the data is taken

6.There are two industry sectors - mining, metals and third all others grouped together as others

7.There are five accident levels

8.There are six potential accident levels

9.There are employees, third parties and remote third parties involved in the accidents

10.There are thirty three diffrent types of critical risk one of which has been assigned to a accident incident

11.The accident description is highly unclean and so it will require a considerable amount of effort to clean it to produce results

12.The dataset consists of data from January 2016 to July 2017

13.Males are more involved than females in accidents, this too looks similar to real world situations as there are considerably lower number of females working in industrial environments

Part 2

2. 1 Summary of the Approach to EDA and Pre-processing

Approach - We have agreed on designing a Chatbot capability using slack as an UI interface integrating with RASA and API that triggers the underlying NLP Model that gets build

We have established agreed intermediate goals and progressed on the below process steps

As part of the NLP Model building we have adoptped the below process steps

Data processing techniques Data cleansing Features engineering Lematizing,stemming Removing stop words and Glove embedding Data visualization with charts to be able to see clearly how the data is spread across different dimentions with univariate, bi and multi variate analysis Model designing - As part of model designing we have designed and trained the below models

Random Forest Gradient Boosting Lgistic regerssion SVM and Neural Network classifiers such as

RNN LSTM and Bi-directional LSTM FastText and we are fine tuning and evaluating the best performing model to be shipped for the API that gets triggered from Slack user interface Findings From the data analysis we could infer that

Many Body related actions and accidenrs have been found A lot of equipment related accidents cited in the dataset Poor features with lack of quality or inadequate data resulting in class imbalance

Since the data shows that the Accident severty is Low for Critical we will have to consider both Accident level as well as Potential accident level for the Model prediction

Data Preparation for Time Series Analysis

Replacing categorical values

Exploratory Data Analysis

Univariate Analysis

Country 01 the most effected country which accounts for 251 accidents country 03 is the least effected country which accounts for 44 accidents in the dataset
Local_3 is the most affected city which accounts for 90 accidents and it belongs to counry_01 Local_11 is the lesat affected city which accounts for 02 accidents and it belongs to Country_01 Local_09 and Local_12 is also lesat affected city which accounts for 02 and 04 accidents respectivily and it belongs to Country_02
Most accidents happened in mining Industry sector .Its count is 241 **Mining sector is the most affected and
Accident level 1 is the most occured accident level and Accident level 5 is the least of all the accidents in the dataset.
Most "Potential Accident Level" belongs to level 4 .Its count is 143. Least "Potential Accident Level" belongs to level 5. Its count is 32.
Male is the most affected gender and female is the least*. Its counts are 403 and 22 accidents.
The most affected employees are Third Party. Its counts are 189. Third Party remote are the least. Its counts are 57.
Most of the Critical risk belongs to other class. Its counts are 232 because in real life most of the accidents are not disclosed.
First quater is the most affected quater which accounts for 154 accidents and Fourth quater is the least affected quater which is 58 accidents. Country_01 accounts for 59% accidents and Country_03 is 10%.
**Mining sector accounts for 57% accidents of the total accidents Others are the least effected Industry which accounts for 12% of the total accidents.**

Bivariate and Multivariate Analysis

Country_01 is the most effected country and most of the classes of Potential Accident Level belongs to country_01

Country_01 is the most effected country and most of the classes of Accident Level belongs to country_01

Mining sector is the most effected and severity level of Accidents also belongs to the same sector

The First and second quater accounts for higher level of Accident which is level 4 and 5.

Most of the classes of Potential Accident Level are from other class of Critical Risk which is 232 in No.

The severity of the Potential Accident Level are from the class Fall, Electrical installation, Vehicles, Projection, Pressed and Mobile equipment.

Mining sector is the most effected sector and most of the classes of Critical Risk comes from this sector.

Accident Level vs Potential Accident Level

fig = px.histogram(ds, color ='Potential Accident Level', x='Accident Level', width=800, height=500) fig.update_layout(bargap = 0.2) fig.show()

Class 1 of the Accident Level accounts for most of the accident and reaches to all the classes of Potential Accident Level which is 1,2,3,4,5

Third Party and Employee are the most effected Employee type

Males are the most effected gender with Potential Accident Level 4 and 5 which is from Mining sector.

Local_3 is the most effected city and most effected class of Employee type are Third Party and Employee.

Local 3 has highest number of Mining industry sector accident.

Local 5 has highest number of Metals industry sector accident.

All the Mining industry sector accidents happend in Local 1,2,3,4,7.

All the Metals industry sector accidents happend in Local 5,6,8,9 .

All the Others industry sector accidents happend in Local 10,11,12.

Most of the Accidents happend in the year 2016 and lower in 2017.

Most of the Mining Accidents happend in the year 2016 and lower in 2017.

Part 3

3.1 Deciding Models and Model Building

Design, train and test with various classifiers

Text Data Cleaning

Analysing the text variable

Design, train and test machine learning classifiers

Random Forest model Training and Evaluation

Gradient Boosting model for Training and Evaluation

Logistic Regression model for Training and Evaluation

Linear SVC model for Training and Evaluation

Design, train and test Neural networks classifiers

Pad Sequences - for Train and Test

Create a weight matrix using GloVe embeddings

Embedding Layer gives us 3D output -> [Batch_Size , Review Length , Embedding_Size]

Design, train and test RNN or LSTM classifiers

FastText Model

With Hyper parameters tuning with below code such as EPOC,Learnign Rate,Word_grams, hierarchical softmax and Multi label(just tried)

Trainig with EPOC 300 for both Accident and Potential accident levels

With WordNGrams

Adding more than '1' Wordgram decrsesing the accuracy so No effect or improvement adding wordgram hyperparameter

With hierarchical softmax - Adding this hyper parameter causing the accuracy to reduce,so no good

Multilabel classification

Closing Remarks:

We do not have a suffcient labeled text data, The dataset only contained only 425 records. Training a neural network requires large datasets because of the network contains huge number of parameters. Hence, training such networks on limited data will often lead to overfitting and low accuracy


SMOT:

Implimentation of SMOT has lead the data into high accuracy with a caveat of Overfitting.

Random Forest model Training and Evaluation:

Train Accuracy of the Random Forest model : 99.82 , Test Accuracy of the Random Forest model : 74.82


Gradient Boosting model for Training and Evaluation:

Train accuracy of the Gradient boosting model : 99.82 , Test accuracy of the Gradient boosting model : 65.47


Logistic Regression model for Training and Evaluation:

Train accuracy of the LR model : 99.82 , Test accuracy of the LR model : 70.50


Linear SVC model for Training and Evaluation:

Train accuracy of the SVC model : 99.82 , Test accuracy of the SVC model : 72.66


classification_report

image.png


BERT

While implementing bert we encountered issues such as paclakes incompatibility, version conflicts and methods of the source files are not stabile in view of the above and time we have parked BERT for future research

We are also aware that the NLP models are typically more shallow and thus require different fine-tuning methods


Fasttext

Fasttext auto hyperparameter tuning requires higher computing capability systems with 16GB RAM was not sufficient to support its computation. Upon,reaching out to community support we studied that few wrappers written in C++ were not stable (for instance autotune.cc wrapper file.)

Owing to this challenge we have parked Auto hyperparameter tuning for future research